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Abstract 

^"^ High-resolution depth maps can be inferred from low-resolution depth measurements and an additional 

^^ high-resolution intensity image of the same scene. To that end, we introduce a bimodal co-sparse analysis 

J^ model, which is able to capture the interdependency of registered intensity and depth information. This 

r \ model is based on the assumption that the co-supports of corresponding bimodal image structures are 

^^ aligned when computed by a suitable pair of analysis operators. No analytic form of such operators exist 

Y\ and we propose a method for learning them from a set of registered training signals. This learning process 

I.— I is done offline and returns a bimodal analysis operator that is universally applicable to natural scenes. 

We use this to exploit the bimodal co- sparse analysis model as a prior for solving inverse problems, which 

K^ leads to an efficient algorithm for depth map super-resolution. 

vn 1 Introduction 

^^ Many technical applications in fields like robotics, 3D video rendering, or human computer interaction are 

CO built upon precise knowledge of the surrounding 3D environment. This information is typically acquired 

^■H either via passive or active range sensors. Passive range sensing, i.e. 3D from stereo intensity images, 

^ is essentially based on three steps. First, ambient light that is reflected from the same object surfaces 

is captured at multiple displaced views. Second, the disparities of corresponding light intensity samples 

between the different views are determined. Third, the distance to the sensor is obtained using the computed 

^ disparities together with the knowledge of the relative positions between all views. Despite very active 

research in this area and significant improvements over the past years, stereo methods still struggle with 

noise, texture-less regions, repetitive texture, and occluded areas. For an overview of stereo methods, the 

reader is referred to [ ]. 

Active sensors, on the other hand, emit light and either measure the time-of-flight of a modulated ray, 
e.g. LIDAR or PMD, or capture the reflection pattern of a structured light source to infer the distance to 
objects, as is done for example by the well-known Microsoft Kinect. Such sensors become more and more 
popular, because they acquire reliable depth measurements independent of the occurring texture and are 
real-time capable. However, the main drawbacks are that the acquired depth maps are of low-resolution 
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Figure 1: Top row: color image (left) and corresponding registered depth map (right) recorded by the Kinect 
sensor. Bottom row: a 3D rendering of the tiger head detail visualizing the difference between the original 
sensor data (left) and the refined version using our proposed method (right). 



(LR) and corrupted by noisy and missing values. To overcome these limitations, different methods for 
upsampling and denoising LR depth maps from range sensors have been proposed, see Section 2. 

A co-occurrence of signal patterns in both the depth map obtained by an active range sensor as well 
as in a corresponding registered camera intensity image, is suggested by the fact that both ambient and 
artificially emitted light is reflected by the same object surfaces. Indeed, some of the most successful 
methods for reconstructing and refining depth maps aim at exploiting this statistical dependency. 

In this paper, we introduce a joint intensity and depth (JID) co-sparse analysis model that exploits the 
dependencies between the two modalities. This model is based on the assumption that the co-supports 
of corresponding structures are aligned when computed by a suitable pair of analysis operators. To that 
end, we propose a method for learning the required bimodal analysis operator from aligned training data. 
This procedure is done only once and offline, and results in a universally applicable operator, which is 
valid for all intensity and depth pairs of natural scenes. This operator together with a high-resolution (HR) 
intensity image is employed for reconstructing a HR depth map that corresponds to the HR intensity image. 
The problem is considered as a linear inverse problem, which is regularized using the bimodal analysis 
operator. Our numerical experiments show that our method compares favorably to state-of-the-art methods 



both visually and quantitatively, and they underpin the validity of our proposed joint intensity and depth 
data model. In summary, the two main contributions of this paper are: 



• 



The new bimodal co- sparse analysis model that reflects the dependencies between properly aligned 
intensity and depth samples from the same scene. 



• An algorithm for simultaneous depth map super-resolution (SR) and inpainting of missing depth 
values, which exploits the introduced data model and allows to cope with various noise models. 

2 Related Work 

Increasing the resolution of depth images obtained from range sensors has become an important research 
topic, and diverse approaches treating this problem have been proposed throughout the past years. Many 
of these methods originate from the closely related problem of intensity image super-resolution. However, 
these mostly aim at producing pleasantly looking results, which is different from the goal of achieving 
geometrically sound depth maps. Straightforward upsamling methods like nearest-neighbor, bilinear, or 
bicubic interpolation produce undesirable staircasing or blurring artifacts, see Figure 2. Here, we shortly 
review more sophisticated methods for depth map SR that aim at reducing these artifacts. 

In a first attempt, methods have been proposed that use smoothing priors from edge statistics [ ] or local 
self-similarities [10]. These methods only require a single image, but either have difficulties in textured 
areas, or only work well for small upscaling factors. A different approach, which also solely requires depth 
information is based on fusing multiple displaced LR depth maps into a single HR depth map. Schuon et 
al. [23] develop a global energy optimization framework employing data fidelity and geometry priors. This 
idea is extended for better edge-preservation by Bhavsar et al. in [ ] . 

A number of recently introduced methods aim at exploiting co-aligned discontinuities in intensity and 
depth images of the same scene. They fuse the HR and LR data utilizing Markov Random Fields (MRF). 
Depth map refinement based on MRF has been first explored in [v.], extended in [ ] with a depth specific 
data term, and combined with depth from passive stereo in [29]. In order to better preserve local structures 
and to remove outliers. Park et al. [ ] add a non-local means term to their MRF formulation. Aodha et al. 
[ ] treat depth SR as an MRF labeling problem of matching LR depth map patches to HR patches from a 
predefined database. 

Inspired by successful stereo matching algorithms, Yang et al. [28] iteratively employ a bilateral filter 
to improve depth SR using an additional HR intensity image. Chan et al. [ ] extend this approach by 
incorporating a noise model specific to depth data. Xiang et al. [ ] include sub-pixel accuracy, and Dolson 
et al. [7] address temporal coherence across a depth data stream from LIDAR scanners by combining a 
bilateral filter with a Gaussian framework. 

Finally, methods exist that exploit the dependency between sparse representations of intensity and depth 
signals over appropriate dictionaries. In [ ], the complex wavelet transform is used as the dictionary. Both 
the HR intensity image and the LR depth map are transformed into this domain and the resulting coefficients 
are fused using a dual tree to obtain the HR depth map. Instead of using predefined bases, approaches 
employing learned dictionaries are known to lead to state-of-the-art performance in diverse classical image 
reconstruction tasks, cf. [ , ]. Surprisingly, applying those techniques for depth map enhancement has 
only very recently been explored. Mahmoudi et al. [16] first learn a depth dictionary from noisy samples, 
then refine and denoise these samples and finally learn an additional dictionary from the denoised samples 
to inpaint, denoise, and super-resolve projected depth maps from 3D models. Closest to our approach are 
the recent efforts of [14] and [25]. They independently learn dictionaries of depth and intensity samples. 



and model a coupling of the two signal types during the reconstruction phase. In [ ], three dictionaries are 
composed from LR depth, HR depth, and HR color samples to learn a respective mapping function based 
on edge features. In contrast, only two dictionaries for intensity and depth are learned in [ ], where the 
similarity of the support of corresponding sparse representations is used to model the coupling. 

3 Proposed Approach 

In our approach, we treat the problem of depth map super-resolution as a linear inverse problem. Basically, 
the goal is to reconstruct a HR depth map s G M^ from a set of measurements y G W^ that are possibly 
corrupted by noise and missing values, i.e. a LR depth map, with m < n. Formally, the relation between s 
and y is given by 

y = ^s + e, (1) 

with A G R^x^ modeling the sampling process, and e G W^ modeling noise and potential sampling 
errors. Here, the dimension m of the measurement vector is significantly smaller than the dimension n of 
the HR depth map. Consequently, reconstructing s in (1) is highly ill-posed. Using additional information 
about the signal's structure helps to tackle this linear inverse problem. 

One prior assumption that has proven useful, is that the signals of interest allow a sparse representation. 
A vector is called sparse, when most of its entries are equal to zero or sufficiently small in magnitude. The 
co-sparse analysis model [18] assumes that applying an analysis operator fi G R^^^ with A: > n to a signal 
s G R^ results in a sparse vector Sis G R^ . If ^ : R^ ^ R denotes a function that measures sparsity like 
the ^0 -pseudo-norm, the analysis model assumption can be exploited to tackle linear inverse problems by 
solving 

s"^ G arg min g{^s) subject to dE{As^ y) < e, (2) 

where ds denotes an appropriate error measure and e G Rq" is an estimated upper bound of the noise 
energy. Typical examples for dE include the squared Euclidean distance. 

Most crucial for the success of the analysis approach is the choice of an appropriate analysis operator. 
Analytic operators, e.g. the finite difference operator, exist. However, using an operator that is learned from 
signal examples is known to yield better performance [12, 19, 2 ^ , "^ ]. 

Our approach to depth map SR utilizes the interdependency of the two modalities intensity and depth. 
In a first step we describe a new data model and how it can be learned in the form of an analysis operator 
pair that incorporates both, signal structure and their according bimodal interdependency. In a second step, 
we explain how this learned prior model can be used for HR signal reconstruction. 

3.1 Bimodal Co-Sparse Analysis Model 

In the analysis model, the zero entries of the analyzed vector fis determine the signal's structure [^^]. 
Geometrically, s lies in the intersection of all hyperplanes whose normal vectors are given by the rows of 
r^ indexed by the zero entries of fis. This index set is called the co- support of s, and is given by 

cosuppiTt^) := {j I {rts)j = 0}. (3) 

Therein, s is a vectorized patch and {^s)j is the j-th entry of the analyzed vector. Now assume that intensity 
signals s/ G R^^ as well as depth signals sd ^ ^'^'^ allow a co- sparse representation with an appropriate 
pair of analysis operators (17/ , fin) ^ R^>^"^i x R^^"^^ Based on the knowledge that a signal's structure is 
encoded in its co-support (3), we postulate that a pair of analysis operators exists such that the co-support 



of Si and sd cire statistically dependent, if both signals originate from the same scene. The bimodal co- 
sparse analysis model assumes that the conditional probability of j belonging to the co-support of s^ given 
that j belongs to the co-support of s/ is significantly higher than the unconditional probability, i.e. 

Pr{{j e cosupp{QdSd)} \ {j ^ cosupp{QiSi)}) :^ Pr{{j G cosupp{QdSd)}) - (4) 

Clearly, this model is idealized, since in practice, the entries of the analyzed vectors are not exactly equal 
to zero. In the next section, we explain how the coupled pair of analysis operators (fi/, flz)) can be jointly 
learned, such that aligned intensity and depth signals analyzed by these operators adhere to the introduced 
model. 

3.2 JID Analysis Operator Learning 

Generally, the goal of learning an analysis operator can be formulated as follows: Given a set {s*^*^ G R"^ }^_^ 
of training samples representing the signal class of interest, find an operator ft G R^^^ with k > n such 
that all representations fts^'^^ are maximally sparse. 

Here, we aim at learning the coupled pair of bimodal analysis operators (17/, fin) G R^><"^i x M^x^2 
for intensity and depth signals. Therefore, we use a set of M aligned and corresponding training pairs 
{(sj ,s]^^) G R^^ X M"^2}^i. More specifically, these are HR intensity and HR depth patches repre- 
senting the same excerpt of a scene. Now, we incorporate the proposed condition (4) into the learning 
process by enforcing the zeros of corresponding analyzed vectors fi/Sj , ri^s)^^ to be at the same posi- 
tions. Throughout the paper, the function x h^ Z]7=i log(l + ^^^j), with u > being a positive weight, 
serves as an appropriate sparsity measure. Note, that any other smooth sparsity measure principally leads 
to similar results. With this, the coupled sparsity is controlled through the function 

k 

To find the ideal pair of bimodal operators we minimize the sum of squares of (5), which can be interpreted 
as a balanced optimization over the expectation and the variance of the analyzed vectors' sparsity and reads 
as 

M 

G{ni,^r,) ■=^Y.9{^^^f^^D^^D?- (6) 

Additionally, we take separate constraints on the operator into account which are motivated in [ ] and 
summarized in the following. 

The possible solutions of the transposed of a single analysis operator are restricted to the set of full-rank 
matrices with normalized columns, known as the oblique manifold OB(n, /c). Since OB(n, /c) is open and 
dense in the set of matrices with normalized columns, the penalty function 

M^):=-;u45logdet(if2Tj^) (7) 

is used to adhere to the rank condition and to prevent iterates to approach the boundary of OB(n, /c). 
Furthermore, a penalty function is incorporated that enforces the operators to have distinctive rows, and 
which controls the mutual coherence of each operator 

r(f2):=-^ log(l - (w^u;,)'), (8) 

\<i<l<k 



with uji denoting the transposed of the i-th row of fl. 

Combining the two penalties into p{ft) := hzh{fl) + /^^(il), with n^ji ^ R+ being positive weights, 
and using m = n2 =: n for legibiHty reasons, our problem of learning the pair of JID analysis operators is 
given by 

{ilj,nl)& argmin G{Xj ,Xl) + p{Xj) + p{Xl). .9) 

The arising optimization problem is solved with a geometric CG method using an Armijo step size rule, cf. 

[ ]. 

For the evaluation of our approach we train one fixed operator pair and use it in all presented exper- 
iments. To that end, we gather a total of M = 15000 pairs of squared sample patches of size ^/n = 5 
from the five registered intensity and depth image pairs 'Babyl', 'Bowlingl', 'Moebius', 'Reindeer' and 
'Sawtooth' of the Middlebury stereo set. As it is common in dictionary learning methods, we require all 
training patches to have zero-mean. Furthermore, we learn the operators with twofold redundancy, i.e. 
k = 2n, resulting in the operator pair (fi/, CId) ^ R^^x^s ^ ]^50x25 j^^ general, a larger redundancy of 
the operators leads to better reconstruction quality but at the cost of increased computational complexity 
of both learning and reconstruction. Twofold redundancy provides a good trade-off between reconstruction 
quality and computation time. We empirically set the remaining parameters toz/ = 10, /^ = 9-10^ and 
/i = 10^ 

3.3 Depth Map Super-Resolution 

In this section, we explain how the pair of patch based bimodal analysis operators (fi/, CId) is used to 
jointly reconstruct an aligned pair of intensity and depth signals s/, s^ G M^ from a set of measurements 
y/ G W^^ , Yd ^ ^^^ . Here S/, S/) are the vectorized versions of an HR intensity image and an HR depth 
map obtained by ordering their entries lexicographically, with N = wh where w and h denote the height 
and width of both HR signals. 

To use our bimodal operator for reconstructing entire images or depth maps, we need to extend the 
application of the operator beyond local patches. To achieve this, we recall the approach in [ ] for the 
unimodal case. Instead of reconstructing each patch individually and combining them in a final step to form 
the image, the complete A^-dimensional signal is reconstructed by minimizing the average sparsity of all 
patches. In this way, neighboring patches support each other during the optimization process. Accordingly, 
a global analysis operator ft^ G R^^^ is constructed from a patch based operator ft G R^^^. Therefore, 
let Vrc G R^^^ denote the operator, which selects the (^/n x y^)-dimensional patch centered at position 
(r, c) from the signal, then the global operator is given as 



n' 






r>KxN 



(10) 



with K = whk, i.e. all patch positions are considered. The reflective boundary condition is used to deal 
with problems along boundaries. 

Now, with the global operator pair (r^f , ^2^), the bimodal extension of the signal reconstruction in (2) 
is given by 

(sj,s5)) G arg min g{nfsi,fl^SD) 
subject to dE ((^/S/, Ad^d) , (y/, Yd)) < e. 



Therein, the sparsity measure g is the same as the one in Equation (5). Consequently, the analyzed versions 
of both modalities are enforced to have a correlated co-support and hence the two signals are coupled. 

The measurement matrices Ai G R^i^^ and Ad ^ m^sxat j^q^^j ^j^^ sampling process of each 
modality. Here, we focus on enhancing the quality of depth measurements y^ , given a fixed high quality 
intensity signal y/ = s/ by simultaneously upsampling and inpainting missing measurements. In this case, 
Ai is the identity operator and the analyzed intensity signal is constant, i.e. flfsj = c = const. This 
simplifies Problem (11) for recovering a HR depth map to 

sj) e arg min a(c,rigsD) 

snem^ (12) 

subject to dEiADSD^yo) < Sd- 

The data fidelity term dE depends on the error model of the depth data and can be chosen accordingly. 
For instance, this may be an error measure tailored to a sensor specific model, cf. Section 4.2. In this 
way, knowledge about the scene gained from the intensity image and its co- support regarding the bimodal 
analysis operators helps to determine the HR depth signal. 

4 Results and Comparison 

In this section we experimentally evaluate our approach by conducting two sets of experiments. First, we 
evaluate our approach numerically on synthetic data using the well-known Middlebury stereo dataset [ ], 
which provides aligned intensity images and depth maps for a number of different test scenes. Second, we 
evaluate our method on real-world data by processing scenes captured with the popular Microsoft Kinect 
sensor. 

4.1 Quantitative Evaluation 

To compare our results to the state-of-the-art, we quantitatively evaluate our algorithm on the four standard 
test images 'Tsukuba', 'Venus', 'Teddy', and 'Cones' from the Middlebury dataset. To artificially create LR 
input depth maps, we scale the ground truth depth maps down by a factor of d in both vertical and horizontal 
dimension. We first blur the available HR image with a Gaussian kernel of size {2d — 1) x {2d — 1) and 
standard deviation a = d/3 before downsampling. The LR depth map and the corresponding HR intensity 
image are the input to our algorithm. 

Here, we assume an i.i.d. normal distribution of the error, which leads to the data fidelity term 
dE{ADSD^yD) = WAd^d — YdWI' From (12) we get the unconstrained optimization problem for re- 
constructing the HR depth signal as 

s^ G aTgmmXg{c,n^SD) + \\AdSd -JdWI- (13) 

Larger values of the weighting factor X (x e^^ lead to a faster convergence of the algorithm but may cause 
larger differences between the measurements and the reconstructed depth map. To achieve the best results 
with few iterations, we start with A = 1 and restart the conjugate gradient optimization procedure five times, 
while consecutively shrinking the multiplier to a final value of A = 10~^. 

Following the methodology described in the work of comparable depth map SR approaches, we use the 
Middlebury stereo matching online evaluation tool^ to quantitatively assess the accuracy of our results with 



^http://vision.middlebury.edu/stereo/eval/ 



d 


method 


Tsukuba 


Venus 


Teddy 


Cones 




nearest-neighbor 


1.24 


0.37 


4.97 


2.51 


2x 


Yang era/. [28] 


1.16 


0.25 


2.43 


2.39 


Haweefa/. ['^] 


1.03 


0.22 


2.95 


3.56 




our method 


0.47 


0.09 


1.41 


1.81 




nearest-neighbor 


3.53 


0.81 


6.71 


5.44 


4x 


Yang et al. 


2.56 


0.42 


5.95 


4.76 


Hawe et al. 


2.95 


0.65 


4.80 


6.54 




our method 


1.73 


0.25 


3.54 


5.16 




nearest-neighbor 


3.56 


1.90 


10.9 


10.4 




Yang et al. 


6.95 


1.19 


11.50 


11.00 


8x 


huetal.[l5] 


5.09 


1.00 


9.87 


11.30 




Hawe et al. 


5.59 


1.24 


11.40 


12.30 




our method 


3.53 


0.33 


6.49 


9.22 



Table 1 : Numerical comparison of our method to other depth map SR approaches for different upscaling 
factors d. The figures represent the percentage of bad pixels with respect to all pixels of the ground truth 
data and an error threshold of S = 1. 



d 


method 


Tsukuba 


Venus 


Teddy 


Cones 




nearest-neighbor 


0.612 


0.288 


1.543 


1.531 




Chan et al. [_ ] 


n/a 


0.216 


1.023 


1.353 


2x 


Aodha et al. [2] 


0.601 


0.296 


0.977 


1.227 




Haweefa/. ["] 


0.278 


0.105 


0.996 


0.939 




our method 


0.255 


0.075 


0.702 


0.680 




nearest-neighbor 


1.189 


0.408 


1.943 


2.470 




Chan et al. 


n/a 


0.273 


1.125 


1.450 


4x 


Aodha et al. 


0.833 


0.395 


1.184 


1.779 




Hawe et al. 


0.450 


0.179 


1.389 


1.398 




our method 


0.487 


0.129 


1.347 


1.383 




nearest-neighbor 


1.135 


0.546 


2.614 


3.260 


8x 


Chan et al. 


n/a 


0.369 


1.410 


1.635 


Hawe et al. 


0.713 


0.249 


1.743 


1.883 




our method 


0.753 


0.156 


1.662 


1.871 



Table 2: Numerical comparison of our method to other depth map SR approaches. The figures represent 
the RMSE in comparison with the ground truth depth map. 



respect to the ground truth data. We report the percentage of bad pixels over all pixels in the depth map 
with an error threshold ofS = 1. Additionally, we provide the root-mean- square error (RMSE) based on 
8-bit images. We rely on the results reported by the authors of comparable methods regarding the numerical 
comparison in Table 1 and Table 2, since an implementation is not publicly available. To show the advantage 
of enforcing a coupled co-support in the analysis formulation, we further employed a single modal operator 



learned by the code provided by the authors of [ ] . This operator has been learned from the same training 
images as described above, and with the parameters documented in their paper. 

As illustrated in Figure 2, our method improves depth map SR considerably over simple interpolation 
approaches. Neither staircasing nor substantial blurring artifacts occur, particularly in areas with discon- 
tinuities. Also, there is no noticeable texture cross-talk in areas of smooth depth and cluttered intensity. 
Edges can be preserved with great detail due to the additional knowledge provided by the intensity image, 
even if SR is conducted using large upscaling factors. The quantitative comparison with other depth map 
SR methods demonstrates the superior performance of our JID analysis operator across all test images. It 
reaches near perfect results for small upscaling factors and the improvement over state-of-the-art methods is 
of particular significance for larger magnification factors. We refer the reader to the supplementary material 
for illustrations of our synthetic test results. 




(a) ground truth 



(b) nearest neighbor 




(c) bicubic 



(d) proposed method 



Figure 2: Visual comparison of different upscaling methods on a detail in the test image Tsukuba from [l2] 
which was downsampled by a factor of 8 in both vertical and horizontal direction. 



4.2 Validation on Kinect Data 

In order to demonstrate the applicability of our algorithm to real data, we captured color images of size 
1280x960 and corresponding depth maps of size 640x480 using the Microsoft Kinect sensor and then 
upscale the depth map by a factor of d = 2 to match its size to the one of the color image. 

Since the approximate error statistics for this application and this sensor have been studied previously in 
[13], we can use this information to further refine our data model. According to [ ], the standard deviation 
of Kinect depth data is proportional to the square of the depth value a^ ex {yjj)'^. We utilize this in our 
error model by employing the squared Mahalanobis distance for d^; in (12), which yields 



sj) e argminA^(c, r^^si)) 



dEiADSD^yo) 



(14) 



where dE = {Ad^d —Yd) ^ ^ {Ad^d — Yd) and T, G M^sxms being a diagonal matrix with main 




Figure 3: Depth maps (top row), 3D rendering of Kinect color and depth data depicting the entire scene 
(middle row) and a detail of the fruit bowl (bottom row). Left column: original Kinect data like in the top 
row of Figure 1 with downsampled color information, center column: bicubic interpolation (1280x960), 
right column: proposed method (1280x960). Note that object shadows are due to the single view occlusion. 

diagonal elements {y]^)'^. 

As the Kinect sensor uses structured light to measure depth, the signal is corrupted by missing pixels due 
to occlusions arising from the displacement of the IR light source and the sensor. To fill these gaps in the 
data, we model the measurement matrix in such a way that it excludes these gaps from the sampling process 
of the LR depth image, i.e. removing the rows of A that correspond to zero entries in y^- As a result, 
we perform inpainting of missing depth values without any additional processing, while simultaneously 
increasing the depth map resolution. By this, we handle two of the main issues of Kinect data in one step. 

To our knowledge, there is no data set publicly available that allows to numerically evaluate Kinect 
depth map enhancing methods by providing ground truth data. Therefore, we assess the quality of the 
super-resolved Kinect depth maps visually. Since small differences in the depth map represented as a gray- 
scale image are almost invisible to the naked eye, we illustrate our results in Figure 3 using ball pivoting 
surface reconstruction [ ] on a point cloud that we created from the depth map computed by our algorithm. 
As it can be seen, our method does not only increase the details in the 3D scene significantly, but also treats 
the missing pixels with great success. This is especially obvious in the details of the tiger head in Figure 



10 



1 and the fruit bowl in Figure 3. The 3D rendering illustrates the impact of the bimodal support during 
reconstruction particularly around depth discontinuities, but it also leads to smoother surfaces of table and 
wall due to the smooth texture of the corresponding intensity signal. 

We would like to emphasize that we use the same JID analysis operators as in the Middlebury experi- 
ments in Section 4.1, even though the training data was captured using a different sensor technology than 
the Kinect. This underpins that the prior model we learn is general enough to be used for high quality 
reconstruction of both synthetic and real world data. 

5 Conclusion and Discussion 

We proposed an approach for inferring high-resolution depth maps from low-resolution depth samples given 
an additional high-resolution intensity image of the scene. We present an extension of the co-sparse analysis 
model to the bimodal case. The required pair of analysis operators is learned jointly such that the co-sparse 
representation of a pair of corresponding intensity and depth samples have a correlated co-support. This 
data model is employed for depth map super resolution and yields improved results on the benchmark data 
set over state-of-the-art methods. Moreover, it greatly improves real- world depth data recorded by a Kinect 
sensor. The fact that the same pre-trained operators can be used to refine both synthetic as well as real-world 
depth maps, underpins the validity of the model assumptions and emphasizes the capability of this method 
to abstract training data appropriately. 

Despite these compelling results, our method certainly has a few limitations. We showed that missing 
pixels can be recovered very successfully with our approach. However, the local assumptions fail if the 
missing areas in the input signal are too large. As a results, inpainting of such large gaps may be inaccurate 
if the global support in our reconstruction model is insufficient to overcome this. For instance, this can 
be observed in the frame of missing pixels around the depth map in Figure 3, which is due to registering 
intensity and depth inputs. Finally, in our current implementation, reconstructing a HR depth image with 
500 iterations takes up to three minutes on a single 3.2 GHz CPU with unoptimized Matlab code. Since 
most of the processing time is dedicated to parallelizable filtering operations, we expect to improve on this 
with a better software implementation and processing on a GPU. Furthermore, the number of iteration in 
the reconstruction may be reduced significantly. As shown in Figure 4, the last 400 iterations only reduce 
the RMSE by about 0.2% and very descent recovery results are achieved with only 50 optimization steps. 



1 fr relative RMSE 



0.5 - 




Tsukuba 

Venus 

Teddy 

Cones 



200 300 

iterations 
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Figure 4: Plot of the relative RMSE over the optimization iterations for the upscaling of the synthetic test 
images by a factor of 8. 
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